devtools::load_all(".")
library(patchwork)

UK Data

uk_cqr <- readr::read_rds(here::here("data_results", "uk_cqr.rds"))

uk_cqr |> dplyr::count(model)
## # A tibble: 6 x 2
##   model                             n
##   <chr>                         <int>
## 1 epiforecasts-EpiExpert         4784
## 2 epiforecasts-EpiExpert_direct  4784
## 3 epiforecasts-EpiExpert_Rt      4784
## 4 EuroCOVIDhub-baseline          4784
## 5 EuroCOVIDhub-ensemble          4784
## 6 seabbs                         4784

The UK Dataset contains predictions from six different models. Each of these models is connected to exactly two values for every combination of target_type, horizon, quantile and target_end_date(i.e. \(2 \cdot 4 \cdot 23 \cdot 13 = 2392\) combinations): One value with the original prediction and one value for the adjusted prediction after applying the CQR method.

Visualizing the raw data

First, one might be interested in any particular covariate combination. The corresponding prediction interval before and after CQR can be plotted with the plot_intervals() function:

plot_intervals(
  uk_cqr,
  model = "seabbs", target_type = "Cases", horizon = 3, quantile = 0.1
)

To visualize the trend along the horizon dimension or for different quantiles for each target_type separately, we can use the plot_intervals_grid() function:

plot_intervals_grid(
  uk_cqr,
  model = "seabbs", quantile = 0.1, facet_by = "horizon"
)

The plots reveal two main findings:

  • CQR seems to make the prediction intervals larger. Since the seabbs forecasts are contributed by a single human, this finding confirms the hypothesis that humans tend to be too confident in their own forecasts leading to narrow prediction intervals.

  • A larger forecast horizon strongly correlates with higher uncertainty and, thus, wider prediction intervals.

According to this impression CQR produces adjusted, often larger, interval forecasts. However, up to now we can not make a statement if the post-processed predictions are actually ‘better’.

Weighted Interval Score and the scoringutils package

We evaluate the quality of forecast intervals with the Weighted Interval Score. This metric is implemented by the scoringutils package. It provides very fine control about the granularity of evaluation.

Let us first replicate the setting from the first plot and analyze if CQR actually improved the forecasts. Since we are primarily interested in out-of-sample performance we filter the data frame down to the validation set:

uk_cqr |>
  extract_validation_set() |>
  scoringutils::score() |>
  scoringutils::summarise_scores(by = c("method", "model", "target_type", "horizon", "quantile")) |>
  dplyr::filter(model == "seabbs", target_type == "Cases", horizon == 3, quantile == 0.1) |>
  dplyr::select(method:dispersion)
## # A tibble: 2 x 7
##   method   model  target_type horizon quantile interval_score dispersion
##   <chr>    <chr>  <chr>         <dbl>    <dbl>          <dbl>      <dbl>
## 1 cqr      seabbs Cases             3      0.1           157.       87.5
## 2 original seabbs Cases             3      0.1           211.       38.1

Indeed, CQR leads to a lower interval score by increasing the dispersion/spread resulting in larger intervals compared to the original forecasts.

In contrast, we can aggregate over all models, target types, horizons and quantiles and evaluate the overall performance of the CQR method:

uk_cqr |>
  extract_validation_set() |>
  scoringutils::score() |>
  scoringutils::summarise_scores(by = c("method")) |>
  dplyr::select(method:dispersion)
## # A tibble: 2 x 3
##   method   interval_score dispersion
##   <chr>             <dbl>      <dbl>
## 1 cqr                62.2       24.1
## 2 original           65.7       12.0

The result shows the same trend: better and wider overall prediction intervals.

Of particular interest might be the question which models benefitted most from CQR adjustments. This question can, of course, be answered in a similar way to above by including model in the by argument of summarise_scores().

eval_methods() and plot_eval()

In addition to the absolute interval score change, which can lose meaning in case of very different scales for each model, the relative or percentage change allows for direct comparisons on a level playing field.

The eval_methods() function displays the relative change of CQR compared to the original predictions. Negative values indicate a score improvement, positive values on the other hand a larger and therefore worse interval score. Note that the function automatically detects included post-processing methods from the input data frame and thus the string 'method' does not have to be provided. Moreover, all displayed values are computed exclusively from the validation data.

eval_methods(uk_cqr, summarise_by = "model")
## # A tibble: 6 x 2
##   model                         relative_change
##   <chr>                                   <dbl>
## 1 epiforecasts-EpiExpert                -0.0568
## 2 epiforecasts-EpiExpert_direct         -0.0551
## 3 epiforecasts-EpiExpert_Rt             -0.061 
## 4 EuroCOVIDhub-baseline                  0.0088
## 5 EuroCOVIDhub-ensemble                 -0.0205
## 6 seabbs                                -0.0849

The output reveals that CQR lead to improved performance for all models except EuroCOVIDhub-baseline. For many categories the magnitudes of the relative change in the output table might still be difficult to compare, even when sorted in increasing order. Thus, the table can be visualized either by bars (only available for a single specified category such as model in this case) or by a heatmap (the default):

df_eval <- eval_methods(uk_cqr, summarise_by = "model")

p1 <- plot_eval(df_eval, heatmap = FALSE, base_size = 8) + ggplot2::labs(y = NULL)
p2 <- plot_eval(df_eval, base_size = 8) + ggplot2::labs(y = NULL)

p1 + p2

These two functions are not restricted to a single category (or even a single post-processing method). To illustrate the multi-category case let’s plot the relative improvements of CQR for each commbination of model and quantile and model and horizon, respectively:

df_mod_quant <- eval_methods(uk_cqr, summarise_by = c("model", "quantile"))
p1 <- plot_eval(df_mod_quant, base_size = 8) + ggplot2::labs(x = NULL)

df_mod_hor <- eval_methods(uk_cqr, summarise_by = c("model", "horizon"))
p2 <- plot_eval(df_mod_hor, base_size = 8) + ggplot2::labs(y = NULL)

p1 + p2

Interestingly, for the EuroCOVIDhub-baseline model predicting constant values, CQR leads to worse interval scores for quantiles in the middle range as well as for extreme horizons but not vice versa! Overall, however, the trends are similar to our conclusions before: Extreme quantiles and large horizons benefit most.

It is worth noting that, starting from the plot output, you can always revert to the exact numerical values simply by analyzing the input data frame directly:

df_mod_hor
## # A tibble: 6 x 5
##   model                             `1`     `2`     `3`     `4`
##   <chr>                           <dbl>   <dbl>   <dbl>   <dbl>
## 1 epiforecasts-EpiExpert         0.0039  0.0109 -0.0634 -0.0883
## 2 epiforecasts-EpiExpert_direct -0.0038 -0.0026 -0.0584 -0.0795
## 3 epiforecasts-EpiExpert_Rt      0.0153  0.0153 -0.0596 -0.109 
## 4 EuroCOVIDhub-baseline          0.024  -0.0731  0.0019  0.0647
## 5 EuroCOVIDhub-ensemble          0.0002  0.0049 -0.0289 -0.0301
## 6 seabbs                        -0.0294 -0.0093 -0.102  -0.109

Further arguments to eval_methods()

In case all of the values in a particular row or column of the eval_methods() output have the same sign, it might be useful to group the values in e.g. below and above average improvements. One way to average ratios (here the difference in scores divided by the original score) is to take the geometric mean. This can be achieved for rows and/or columns with the row_averages and col_averages arguments to eval_methods():

df_mod_tar <- eval_methods(
  uk_cqr,
  summarise_by = c("model", "target_type"),
  row_averages = TRUE, col_averages = TRUE
)
df_mod_tar
## # A tibble: 7 x 4
##   model                           Cases  Deaths average_change
##   <chr>                           <dbl>   <dbl>          <dbl>
## 1 epiforecasts-EpiExpert        -0.0568 -0.0106        -0.034 
## 2 epiforecasts-EpiExpert_direct -0.0551 -0.0131        -0.0343
## 3 epiforecasts-EpiExpert_Rt     -0.0609 -0.0976        -0.0795
## 4 EuroCOVIDhub-baseline          0.0094 -0.129         -0.0622
## 5 EuroCOVIDhub-ensemble         -0.0205 -0.014         -0.0173
## 6 seabbs                        -0.0849 -0.0893        -0.0871
## 7 <NA>                          -0.0453 -0.0601        -0.0527

The last row computes geomtric means for each column and the last column computes geometric means for each row. Since target_type only has two possible values, the geometric mean in the last column always lies strictly in between the two values of the corresponding row.

This table can be visualized in the same way as before:

plot_eval(df_mod_tar)

The plot shows that CQR improvements are sometimes stronger for Cases and sometimes for Deaths depending on the model.

Finally, one might analyze both the performance for a combination of two categories as well as the marginal performance for a single category. This can be done separately:

eval_methods(uk_cqr, summarise_by = "model")
eval_methods(uk_cqr, summarise_by = "horizon")
eval_methods(uk_cqr, summarise_by = c("model", "horizon"))

However, jumping around between three different tables can be annoying. For that reason the marginal changes can be added to the two-dimensional table with the margins argument:

eval_mod_hor <- eval_methods(
  uk_cqr,
  summarise_by = c("model", "horizon"), margins = TRUE
)

Due to the large number of entries a picture is again convenient:

plot_eval(eval_mod_hor)

While the marginal performance of all models except EuroCOVIDhub-baseline is improved by applying CQR, the interaction between those models and low horizons do not benefit from post-processing.

European Forecast Hub

hub_1 <- readr::read_rds(here::here("data_results", "hub_cqr_1.rds"))
hub_2 <- readr::read_rds(here::here("data_results", "hub_cqr_2.rds"))
hub_cqr <- dplyr::bind_rows(hub_1, hub_2)

For analyzing the impact of CQR on the European Hub Data, we selected a subset of six different models. In contrast to the UK Dataset we can now analyze performance differences between \(18\) European countries.

hub_cqr |> dplyr::count(model)
## # A tibble: 6 x 2
##   model                        n
##   <chr>                    <int>
## 1 epiforecasts-EpiNow2    192096
## 2 EuroCOVIDhub-baseline   192096
## 3 EuroCOVIDhub-ensemble   192096
## 4 IEM_Health-CovidProject 192096
## 5 ILM-EKF                 192096
## 6 USC-SIkJalpha           192096
hub_cqr |> dplyr::count(location_name)
## # A tibble: 18 x 2
##    location_name      n
##    <chr>          <int>
##  1 Austria        64032
##  2 Bulgaria       64032
##  3 Croatia        64032
##  4 Denmark        64032
##  5 Finland        64032
##  6 Germany        64032
##  7 Greece         64032
##  8 Italy          64032
##  9 Latvia         64032
## 10 Luxembourg     64032
## 11 Malta          64032
## 12 Norway         64032
## 13 Poland         64032
## 14 Portugal       64032
## 15 Romania        64032
## 16 Slovakia       64032
## 17 Slovenia       64032
## 18 United Kingdom 64032

Let us first get a quick impression for the predictions in Germany:

plot_intervals_grid(hub_cqr, model = "EuroCOVIDhub-ensemble", location = "DE", quantiles = 0.1)

Here the CQR adjustments are more interesting than for the UK Data: CQR appears to extend the prediction intervals for Cases whereas it reduced the spread for Deaths.

We can check if this statement holds averaged across all models and quantiles:

hub_cqr |>
  dplyr::filter(location_name == "Germany") |>
  scoringutils::score() |>
  scoringutils::summarise_scores(by = c("method", "target_type")) |>
  dplyr::arrange(target_type) |>
  dplyr::select(method:dispersion)
## # A tibble: 4 x 4
##   method   target_type interval_score dispersion
##   <chr>    <chr>                <dbl>      <dbl>
## 1 cqr      Cases               21.1      10.2   
## 2 original Cases               22.3       7.87  
## 3 cqr      Deaths               0.177     0.0923
## 4 original Deaths               0.181     0.0896

While CQR improves the interval score for both target_types, the dispersion also increases in both cases such that the previous finding does not generalize.

Next, we consider the effects of all models for each country:

df_mod_loc <- eval_methods(hub_cqr, summarise_by = c("model", "location_name"))
plot_eval(df_mod_loc)

This plot is incredibly uninformative since Poland seems to be a huge outlier for all models, but particularly for the IEM_Health-CovidProject predictions. Here, CQR makes the forecast intervals around \(400\)% worse, which is quite hard to believe.

Let us briefly look into the scores for Poland in more detail by computing the raw scores on the training and validation set for CQR separately:

cqr_poland <- hub_cqr |>
  dplyr::filter(location_name == "Poland")

cqr_poland |>
  extract_training_set() |>
  scoringutils::score() |>
  scoringutils::summarise_scores(by = c("method", "model")) |>
  dplyr::arrange(model) |>
  dplyr::select(method:dispersion)
## # A tibble: 12 x 4
##    method   model                   interval_score dispersion
##    <chr>    <chr>                            <dbl>      <dbl>
##  1 cqr      epiforecasts-EpiNow2              22.8       7.24
##  2 original epiforecasts-EpiNow2              23.7       6.24
##  3 cqr      EuroCOVIDhub-baseline             31.7       9.88
##  4 original EuroCOVIDhub-baseline             35.8       4.01
##  5 cqr      EuroCOVIDhub-ensemble             29.4       9.75
##  6 original EuroCOVIDhub-ensemble             31.4       7.54
##  7 cqr      IEM_Health-CovidProject           56.5      17.9 
##  8 original IEM_Health-CovidProject           62.0      12.0 
##  9 cqr      ILM-EKF                           37.8      12.5 
## 10 original ILM-EKF                           38.5      11.4 
## 11 cqr      USC-SIkJalpha                     27.4      10.9 
## 12 original USC-SIkJalpha                     30.6       8.06

cqr_poland |>
  extract_validation_set() |>
  scoringutils::score() |>
  scoringutils::summarise_scores(by = c("method", "model")) |>
  dplyr::arrange(model) |>
  dplyr::select(method:dispersion)
## # A tibble: 12 x 4
##    method   model                   interval_score dispersion
##    <chr>    <chr>                            <dbl>      <dbl>
##  1 cqr      epiforecasts-EpiNow2             1.94      1.21  
##  2 original epiforecasts-EpiNow2             1.32      0.418 
##  3 cqr      EuroCOVIDhub-baseline            6.46      5.73  
##  4 original EuroCOVIDhub-baseline            3.34      2.69  
##  5 cqr      EuroCOVIDhub-ensemble            2.34      1.87  
##  6 original EuroCOVIDhub-ensemble            0.964     0.355 
##  7 cqr      IEM_Health-CovidProject          4.68      4.22  
##  8 original IEM_Health-CovidProject          0.908     0.222 
##  9 cqr      ILM-EKF                          1.71      1.35  
## 10 original ILM-EKF                          0.862     0.501 
## 11 cqr      USC-SIkJalpha                    2.77      1.89  
## 12 original USC-SIkJalpha                    1.65      0.0410

The discrepancy is striking! This indicates a sudden change of the COVID situation in Poland right around where we splitted the data. We compare the development of COVID Cases between Poland and Germany

hub_cqr |>
  dplyr::filter(location %in% c("PL", "DE"), method == "original", model == "IEM_Health-CovidProject", target_type == "Cases", quantile == 0.05, horizon == 1) |>
  dplyr::select(location_name, forecast_date, true_value) |>
  tidyr::pivot_wider(names_from = location_name, values_from = true_value) |>
  print(n = Inf)
## # A tibble: 29 x 3
##    forecast_date Germany Poland
##    <date>          <dbl>  <dbl>
##  1 2021-03-15     120.   388.  
##  2 2021-03-22     137.   487.  
##  3 2021-03-29     125.   510.  
##  4 2021-04-05     149.   362.  
##  5 2021-04-12     175.   324.  
##  6 2021-04-19     166.   199.  
##  7 2021-04-26     160.   124.  
##  8 2021-05-03     129.    80.5 
##  9 2021-05-10      78.6   59.8 
## 10 2021-05-17      69.1   33.3 
## 11 2021-05-24      38.3   18.0 
## 12 2021-05-31      26.9    9.09
## 13 2021-06-07      18.6    6.37
## 14 2021-06-14       8.83   3.66
## 15 2021-06-21       5.51   2.46
## 16 2021-06-28       4.72   1.70
## 17 2021-07-05       6.17   1.42
## 18 2021-07-12       9.78   1.58
## 19 2021-07-19      12.8    1.87
## 20 2021-07-26      18.0    2.30
## 21 2021-08-02      22.8    2.73
## 22 2021-08-09      35.0    3.18
## 23 2021-08-16      55.1    3.50
## 24 2021-08-23      76.8    3.99
## 25 2021-08-30      98.3    5.62
## 26 2021-09-06      79.0    7.93
## 27 2021-09-13      77.4   11.1 
## 28 2021-09-20      66.3   13.7 
## 29 2021-09-27      68.3   18.9

p1 <- cqr_poland |>
  plot_intervals(model = "IEM_Health-CovidProject", target_type = "Cases")

p2 <- hub_cqr |>
  dplyr::filter(location == "DE") |>
  plot_intervals(model = "IEM_Health-CovidProject", target_type = "Cases")

p1 + p2

In both countries the incidence dropped drastically during the summer months of 2021. While the number of Cases increased again in autumn in Germany, the incidence in Poland stayed very low (at least in the dataset). Thus the distribution while training CQR looks very different from the distribution of Cases during inference.

Since the plot shows 1 week ahead forecasts the original predictions could be adapted quite fast based on human knowledge about the current situation at that time. CQR in contrast adjusts slowly to the distribution shift.

We exclude Poland from the further analysis to obtain more meaningful visual illustrations for the remaining countries. Let’s display the same picture from before without Poland:

cqr_no_poland <- hub_cqr |>
  dplyr::filter(location_name != "Poland")

df_mod_loc <- eval_methods(cqr_no_poland, summarise_by = c("model", "location_name"))
plot_eval(df_mod_loc)

For all other countries CQR has the largest improvements for the USC-SIkJalpha model that contains predictions from a research group of the University of California.

In case of the UK Data we could improve forecasts most when considering large forecast horizons, quantiles in the tails of the predictive distribution and Cases instaed of Deaths. We now investigate if these effects persist for the European Forecast Hub, but now stratified by country:

df_hor_loc <- eval_methods(cqr_no_poland, summarise_by = c("horizon", "location_name"))
plot_eval(df_hor_loc)


df_quant_loc <- eval_methods(cqr_no_poland, summarise_by = c("quantile", "location_name"))
plot_eval(df_quant_loc)


df_target_loc <- eval_methods(cqr_no_poland, summarise_by = c("target_type", "location_name"))
plot_eval(df_target_loc)

Indeed, the same trends are identifiable, although to a much more moderate degree.

After removing Poland form the data now Croatia seems to be an outlier, where CQR does not improve the interval score across many scenarios.

hub_cqr |>
  dplyr::filter(location_name == "Croatia") |>
  extract_validation_set() |>
  scoringutils::score() |>
  scoringutils::summarise_scores(by = c("method", "model", "target_type", "quantile", "horizon")) |>
  dplyr::filter(target_type == "Cases", horizon == 2, quantile == 0.2) |>
  dplyr::arrange(model) |>
  dplyr::select(method:dispersion)
## # A tibble: 12 x 7
##    method  model          target_type quantile horizon interval_score dispersion
##    <chr>   <chr>          <chr>          <dbl>   <dbl>          <dbl>      <dbl>
##  1 cqr     epiforecasts-… Cases            0.2       2           19.2      12.5 
##  2 origin… epiforecasts-… Cases            0.2       2           19.3      11.6 
##  3 cqr     EuroCOVIDhub-… Cases            0.2       2           29.3      27.5 
##  4 origin… EuroCOVIDhub-… Cases            0.2       2           20.3      12.0 
##  5 cqr     EuroCOVIDhub-… Cases            0.2       2           12.3      10.3 
##  6 origin… EuroCOVIDhub-… Cases            0.2       2           11.5       8.97
##  7 cqr     IEM_Health-Co… Cases            0.2       2           15.0      11.8 
##  8 origin… IEM_Health-Co… Cases            0.2       2           13.8       8.35
##  9 cqr     ILM-EKF        Cases            0.2       2           21.4      18.1 
## 10 origin… ILM-EKF        Cases            0.2       2           20.8      14.1 
## 11 cqr     USC-SIkJalpha  Cases            0.2       2           20.7      12.9 
## 12 origin… USC-SIkJalpha  Cases            0.2       2           19.6       4.15

In this case, however, this performance drop is exclusively caused by the EuroCOVIDhub-baseline model, where CQR increases the interval width by a large amount. For this specific covariate combination the new adjusted intervals contain the true value only in few more situations than the original predictions at the cost of a much lower precision:

hub_cqr |>
  dplyr::filter(location_name == "Croatia") |>
  plot_intervals(model = "EuroCOVIDhub-baseline", target_type = "Cases", horizon = 2, quantile = 0.2)